105 research outputs found

    Factor Analysis for Multiple Testing (FAMT): An R Package for Large-Scale Significance Testing under Dependence

    Get PDF
    The R package FAMT (factor analysis for multiple testing) provides a powerful method for large-scale significance testing under dependence. It is especially designed to select differentially expressed genes in microarray data when the correlation structure among gene expressions is strong. Indeed, this method reduces the negative impact of dependence on the multiple testing procedures by modeling the common information shared by all the variables using a factor analysis structure. New test statistics for general linear contrasts are deduced, taking advantage of the common factor structure to reduce correlation and consequently the variance of error rates. Thus, the FAMT method shows improvements with respect to most of the usual methods regarding the non discovery rate and the control of the false discovery rate (FDR). The steps of this procedure, each of them corresponding to R functions, are illustrated in this paper by two microarray data analyses. We first present how to import the gene ex- pression data, the covariates and gene annotations. The second step includes the choice of the optimal number of factors, the factor model fitting, and provides a list of selected genes according to a preset FDR control level. Finally, diagnostic plots are provided to help the user interpret the factors using available external information on either genes or arrays.

    Stabilité de la sélection de variables pour la classification de données en grande dimension

    No full text
    International audienceLes données à haut-débit ont motivé le développement de méthodes statistiques pour la sélection de variables. Ces données sont caractérisées par leur grande dimension et par leur hétérogénéité car le signal est souvent observé simultanément à plusieurs facteurs de confusion. Les approches habituelles sont ainsi remises en question car elles peuvent conduire à des décisions erronées. Efron (2007), Leek and Storey (2007, 2008), Friguet et al (2009) montrent l'impact négatif de l'hétérogénéité des données sur le nombre de faux-positifs des tests multiples. La sélection de variables est une étape importante de la construction d'un modÚle de classification en grande dimension car elle réduit la dimension du problÚme aux variables les plus prédictives. On s'intéresse ici aux performances de classification de la sélection de variables, via la procédure LASSO (Tibshirani (1996)) et à la reproductibilité des ensembles de variables sélectionnés. Des simulations montrent que l'ensemble des variables sélectionnées par le LASSO n'est pas celui des meilleurs prédicteurs théoriques. Aussi, d'intéressantes performances de classification ne sont atteintes que pour un grand nombre de variables sélectionnées. Notre méthode s'appuie sur la description de la dépendance entre covariables grùce à un petit nombre de variables latentes (Friguet et al. (2009)). La stratégie proposée consiste à appliquer les procédures sur les données conditionnellement à cette structure de dépendance. Cette stratégie permet de stabiliser les variables sélectionnées : d'intéressantes performances de classification sont atteintes pour de plus petits ensembles de variables et les variables les plus prédictives sont détectées

    Signal identification in ERP data by decorrelated Higher Criticism Thresholding

    Get PDF
    Event-related potentials (ERPs) are intensive recordings of electrical activity along the scalp time-locked to motor, sensory, or cognitive events. A main objective in ERP studies is to select (rare) time points at which (weak) ERP amplitudes (features) are significantly associated with experimental variable of interest. The Higher Criticism Thresholding (HCT), as an optimal signal detection procedure in the " rare-and-weak " paradigm, appears to be ideally suited for identifying ERP features. However, ERPs exhibit complex temporal dependence patterns violating the assumption under which signal identification can be achieved efficiently for HCT. This article first highlights this impact of dependence in terms of instability of signal estimation by HCT. A factor modeling for the covariance in HCT is then introduced to decorrelate test statistics and to restore stability in estimation. The detection boundary under factor-analytic dependence is derived and the phase diagram is correspondingly extended. Using simulations and a real data analysis example, the proposed method is shown to estimate more efficiently the support of signals compared with standard HCT and other HCT approaches based on a shrinkage estimation of the covariance matrix

    Inferring gene networks using a sparse factor model approach, Statistical Learning and Data Science

    No full text
    The availability of genome-wide expression data to complement the measurements of a phenotypic trait opens new opportunities for identifying biologic processes and genes that are involved in trait expression. Usually differential analysis is a preliminary step to identify the key biological processes involved in the variability of the trait of interest. However, this variability shall be viewed as resulting from a complex combination of genes individual contributions. In other words, exploring the interactions between genes viewed in a network structure which vertices are genes and edges stand for inhibition or activation connections gives much more insight on the internal structure of expression profiles. Many currently available solutions for network analysis have been developed but an efficient estimation of the network from high-dimensional data is still a questioning issue. Extending the idea introduced for differential analysis by Friguet et al. (2009) [1] and Blum et al. (2010) [2], we propose to take advantage of a factor model structure to infer gene networks. This method shows good inferential properties and also allows an efficient testing strategy for the significance of partial correlations, which provides an interesting tool to explore the community structure of the networks. We illustrate the performance of our method comparing it with competitors through simulation experiments. Moreover, we apply our method in a lipid metabolism study that aims at identifying gene networks underlying the fatness variability in chickens

    Factor Analysis for Multiple Testing (FAMT): An R Package for Large-Scale Significance Testing under Dependence

    Get PDF
    The R package FAMT (factor analysis for multiple testing) provides a powerful method for large-scale significance testing under dependence. It is especially designed to select differentially expressed genes in microarray data when the correlation structure among gene expressions is strong. Indeed, this method reduces the negative impact of dependence on the multiple testing procedures by modeling the common information shared by all the variables using a factor analysis structure. New test statistics for general linear contrasts are deduced, taking advantage of the common factor structure to reduce correlation and consequently the variance of error rates. Thus, the FAMT method shows improvements with respect to most of the usual methods regarding the non discovery rate and the control of the false discovery rate (FDR). The steps of this procedure, each of them corresponding to R functions, are illustrated in this paper by two microarray data analyses. We first present how to import the gene expression data, the covariates and gene annotations. The second step includes the choice of the optimal number of factors, the factor model fitting, and provides a list of selected gene according to a preset FDR control level. Finally, diagnostic plots are provided to help the user interpret the factors using a vailable external information on either genes or arrays

    Décorrélation adaptative pour la prédiction en grande dimension

    Get PDF
    International audienceIn large-scale signicance analysis, ignoring dependence or not is a core issue, leading to many recent results about the impact of decorrelating the pointwise test statistics. Yet, for the estimation of a prediction model, decorrelating large proles of predicting variables is not as clearly questioned, although many comparative studies have reported the superiority of so-called naive methods, ignoring dependence. Under the usual Gaussian mixture model assumption of Linear Discriminant Analysis, we show that, for a given dependence structure, the classication performance of methods ignoring or not dependence may be markedly dierent, according to the pattern of the association signal between the predicting variables and the response. In order to minimize the largest probability of misclassication, we propose a method handling adaptively the dependence. A simulation study shows that the performance of the present method is at least as good as the best of methods ignoring dependence or based on a complete decorrelation of the predicting variables. 1Dans les procĂ©dures de tests en grande dimension, la prise en compte ou non de la dĂ©pendance donne lieu Ă  de nombreux dĂ©veloppements mĂ©thodologiques et discussions , notamment sur l'impact de la dĂ©corrĂ©lation des statistiques de tests. Pourtant, dans une optique d'estimation d'un modĂšle pour la prĂ©diction, la question de la dĂ©corrĂ©la-tion de grands prols de variables prĂ©dictrices n'est pas abordĂ©e dans les mĂȘmes termes, bien que de nombreuses Ă©tudes comparatives aient rapportĂ© la supĂ©rioritĂ© de mĂ©thodes de prĂ©diction dites naĂŻves, au sens oĂč elles ignorent la dĂ©pendance. Sous l'hypothĂšse clas-sique en analyse linĂ©aire discriminante d'un mĂ©lange de lois gaussiennes, nous montrons que pour une structure de dĂ©pendance des prĂ©dicteurs donnĂ©e, les performances de clas-sication ignorant ou non cette dĂ©pendance peuvent ĂȘtre trĂšs variables et opposĂ©es selon la forme du signal d'association entre les prĂ©dicteurs et la classe. An de minimiser le risque maximal d'erreur de classication, nous proposons donc une prise en compte adap-tative de la dĂ©pendance et montrons sur des simulations que les performances de la rĂšgle de classication proposĂ©e sont gĂ©nĂ©ralement au moins aussi bonnes que la meilleure des rĂšgles parmi celles ignorant la dĂ©pendance ou au contraire basĂ©es sur une dĂ©corrĂ©lation des prĂ©dicteurs

    Variable selection for correlated data in high dimension using decorrelation methods

    Get PDF
    International audienceThe analysis of high throughput data has renewed the statistical methodology for feature selection. Such data are both characterized by their high dimension and their heterogeneity, as the true signal and several confusing factors are often observed at the same time. In such a framework, the usual statistical approaches are questioned and can lead to misleading decisions as they are initially designed under independence assumption among variables. In this talk, I will present some improvements of variable selection methods in regression and supervised classification issues, by accounting for the dependence between selection statistics. The methods proposed in this talk are based on a factor model of covariates, which assumes that variables are conditionally independent given a vector of latent variables. During this talk, I will illustrate the impact of dependence on the stability on some usual selection procedures. Next, I will particularly focus on the analysis of event-related potentials data (ERP) which are widely collected in psychological research to determine the time courses of mental events. Such data are characterized by a temporal dependence pattern both strong and complex which can be modeled by the mentioned above factor model

    A transcriptome multi-tissue analysis identifies biological pathways and genes associated with variations in feed efficiency of growing pigs

    Get PDF
    International audienceBackground - Animal's efficiency in converting feed into lean gain is a critical issue for the profitability of meat industries. This study aimed to describe shared and specific molecular responses in different tissues of pigs divergently selected over eight generations for residual feed intake (RFI). Results - Pigs from the low RFI line had an improved gain-to-feed ratio during the test period and displayed higher leanness but similar adiposity when compared with pigs from the high RFI line at 132 days of age. Transcriptomics data were generated from longissimus muscle, liver and two adipose tissues using a porcine microarray and analyzed for the line effect (n = 24 pigs per line). The most apparent effect of the line was seen in muscle, whereas subcutaneous adipose tissue was the less affected tissue. Molecular data were analyzed by bioinformatics and subjected to multidimensional statistics to identify common biological processes across tissues and key genes participating to differences in the genetics of feed efficiency. Immune response, response to oxidative stress and protein metabolism were the main biological pathways shared by the four tissues that distinguished pigs from the low or high RFI lines. Many immune genes were under-expressed in the four tissues of the most efficient pigs. The main genes contributing to difference between pigs from the low vs high RFI lines were CD40, CTSC and NTN1. Different genes associated with energy use were modulated in a tissue-specific manner between the two lines. The gene expression program related to glycogen utilization was specifically up-regulated in muscle of pigs from the low RFI line (more efficient). Genes involved in fatty acid oxidation were down-regulated in muscle but were promoted in adipose tissues of the same pigs when compared with pigs from the high RFI line (less efficient). This underlined opposite line-associated strategies for energy use in skeletal muscle and adipose tissue. Genes related to cholesterol synthesis and efflux in liver and perirenal fat were also differentially regulated in pigs from the low vs high RFI lines. Conclusions - Non-productive functions such as immunity, defense against pathogens and oxidative stress contribute likely to inter-individual variations in feed efficiency

    A factor model to analyze heterogeneity in gene expression

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray technology allows the simultaneous analysis of thousands of genes within a single experiment. Significance analyses of transcriptomic data ignore the gene dependence structure. This leads to correlation among test statistics which affects a strong control of the false discovery proportion. A recent method called FAMT allows capturing the gene dependence into factors in order to improve high-dimensional multiple testing procedures. In the subsequent analyses aiming at a functional characterization of the differentially expressed genes, our study shows how these factors can be used both to identify the components of expression heterogeneity and to give more insight into the underlying biological processes.</p> <p>Results</p> <p>The use of factors to characterize simple patterns of heterogeneity is first demonstrated on illustrative gene expression data sets. An expression data set primarily generated to map QTL for fatness in chickens is then analyzed. Contrarily to the analysis based on the raw data, a relevant functional information about a QTL region is revealed by factor-adjustment of the gene expressions. Additionally, the interpretation of the independent factors regarding known information about both experimental design and genes shows that some factors may have different and complex origins.</p> <p>Conclusions</p> <p>As biological information and technological biases are identified in what was before simply considered as statistical noise, analyzing heterogeneity in gene expression yields a new point of view on transcriptomic data.</p
    • 

    corecore